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Clustering is one of the most widely used procedures in the analysis of mi- 
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. croarray data, for example with the goal of discovering cancer subtypes based 

on observed heterogeneity of genetic marks between different tissues. It is well- 
0^ . known that in such high-dimensional settings, the existence of many noise vari- 



ables can overwhelm the few signals embedded in the high-dimensional space. 



^ ■ We propose a novel Bayesian approach based on Dirichlet process with a spar- 

se ; 

sity prior that simultaneous performs variable selection and clustering, and 
also discover variables that only distinguish a subset of the cluster compo- 
nents. Unlike previous Bayesian formulations, we use Dirichlet process (DP) 
for both clustering of samples as well as for regularizing the high-dimensional 
mean/variance structure. To solve the computational challenge brought by 
this double usage of DP, we propose to make use of a sequential sampling 
scheme embedded within Markov chain Monte Carlo (MCMC) updates to im- 



prove the naive implementation of existing algorithms for DP mixture models. 
Our method is demonstrated on a simulation study and illustrated with the 
leukemia gene expression dataset. 

keywords: Dirichlet process; Markov chain Monte Carlo; Sequential sam- 
pling; Sparsity prior; 



1 Introduction 



Clustering is one of the most widely used procedures in the analysis of microarray 



data. It has been used, for example, for cancer subtype discovery (jGolub et al. 



19991 ) . Technological advances over the last decade on microarrays have made possible 
simultaneous investigation of thousands of genes that potentially characterize and 
distinguish previously known or unknown cancer subtypes. Although obviously not 
all the genes arrayed possess discriminative power for different cancer subtypes, if 
fewer genes are used, the procedure might fail to distinguish between some of the 
subtypes. Also, because of the cost of arraying the transcripts, this is a typical "large 
p, small n" problem that has attracted much attention recently. 



A mong many classes o f cluste ring procedures, the model-based approach (jBanfield and Raftery . 



1993 



Fraley and Raftery 



20021 ) . assuming the data come from a mixture of distri- 



butions, has the advantage of permitting principled statistical inferences compared 
to other procedures based largely on heuristics, such as k-means. This is especially 
important in our case where inferences should be made on the selected variables as 
well as clustering structure. 

Motivated by model interpretation as well as parsimony considerations, vari- 
able selection in clustering, mostly within the Bayesian framewor k, has been of in 



creasing interest. Compared to va riable selection in regression ( iTibshirani 



George and McCulloch 



199 



3 



1996 



19971 ). the clustering problem is much less studied. 
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Friedman and Meulmanl (120041 ) proposed one approach to select different subsets of 
variables a nd differen t assoc iated weights for different clusters for non-model-based 

(120031 ) proposed to first reduce dimension by performing the 



clustering. 



Liu et al. 



principal component analysis on the covariates and then fitting a Bayesian mixture 
model to the top factors. Although the number of factors is automatically determined 
by the model, there are several disadvantages to this approach, including difficulty 
in interpretation in terms of the original attributes. Also, it can be argued that 
the top principal components do not necessarily have the most signi ficant discrim- 



inative power for clustering and thus the proc edure is suboptimal. 



Tadcssc et a 



(120051 ) adapted the stochastic search strategy of iGeorge and McCullochl ( 119931 . 119971 ) 



originally proposed in the regression context and used reversible jump MCMC for 
inferences of cluster structures with simultaneous variable selection. This approach 
assumes that the same subsets of covariates discriminate all clusters. The model laid 



out in 



Kim et al. 



(120061 ) is based on the same philosophy but utilized an infini t e mix- 
ture model via the Dirichlet process (DP) mixture. On the other hand, iHoffl (120061 ) 
adopted a mean shift approach in which each cluster-specific mean deviates from the 
global base-line mean on one or more attributes that differs from cluster to cluster. 

In this paper, we propose a Bayesian model for simultaneous clustering and vari- 
able selection via DP mixture as well. Our formulation is based on the mean shift 
model (1 



that of 



Hofi, 



2006). How ever, we use a novel hierarchical sparsity prior similar to 



Lucas et al. 



(120061 ) which can improve separation of significant signals from 
noise variables and thus can lead to reduced false discoveries. Also, we use a Dirich- 
let process shrinkage approach for both high-dimensional mean and variance that 
outperforms shrinkage using a non-DP prior, typically with normal distribution for 
mean and inverse-Gamma distribution for variance. Because of this double usage of 
Dirichlet process, both for sample clustering and for covariate shrinkage, the direct 
implementation of standard DP algorithms available in the literature becomes very 



inefficient. We solve this problem by utilizing an embedded sequential sampling step 
as the proposal distribution in the Markov chain Monte Carlo (MCMC) iterations. 
In the next section, we formulate our model using the sparsity prior. Posterior com- 
putation via MCMC is discussed in detail in Section 3 where we also show how to use 
sequential sampling for efficient updating. For clarity in exposition, these two sec- 
tions only consider shifts in means. Extension to shifts in both means and variances 
is briefly considered in Section 4. Section 5 includes a simulation study as well as 
an application to the leukemia gene expression data. We conclude the article with a 
brief discussion in Section 6. 



2 Model Formulation 

In this section as well as the next, we consider the case where the clusters differ from 
each other only in terms of their respective means for some of the attributes. In our 
model we start by expressing the samples y« = (yn, . . . , yi P ), i — 1, . . . , n, as 

Vij = Hj + Hij + ay,,. €ij N(0, 1). 

In this formulation, /ij and Uj are attribute-specific mean and standard deviations 
shared by all samples. We put the following priors for them: 

fij l ~ d DP(aN(fi ,a )), 

a 1 l >t d DP([3Inv — Gamma(ao, /3q)), 

where DP(aH) is the Dirichlet process with concentration parameter a > and 
base probability measure H. In this paper, we use the notation Q { DP(aH) as 
a short form for the more rigorous ^ G, G ~ DP(aH). This might be a misuse 
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but simplifies our notation since DP appears multiple times at different places within 
our model. When a — > oo, the first expression above reduces to \ij ~ N(fi Q ,a ), for 
example. The use of Dirichlet process can be motivated from at least two point of 
views. First, it relaxes the normality assumption imposed on the components of the 
mean vector. Second, since the DP is a discrete measure, it provides a regularization 
mechanism by shrinking different parameters towards each other. 

Since the attribute specific fij and <jj are shared by all samples, t he clusterin g 



structure can on ly derive from appropriate specification on As in 



KimetaL 



Hoffj fl2006h : 



(120061 ). the clustering of samples will be determined by an infinite mixture 
of distributions via Dirichlet process mixture. Denote [i i = (p,n, . . . When it is 

intended that \x { is the mean for cluster c, i.e. sample % is assigned to cluster c, we also 
use /x c to denote the same mean vector. Although there might be some concern over 
misuse of notation, this can hardly cause any confusion in the context. The sample 
means are generated from an infinite mixture specified as the following: 



= (fHi, Hip) *'~ ' DP(tH), 



where the base measure H on /x^ can be defined through the following hierarchical 
"point-mass mixture" prior: 

Hij ~ (1-^)50 + ^^(7^(0,^)), 
7Ty ~ (1 - pj)5 + pjBeta(a, b), 
Pj ~ Beta(c,d). 

Thus in our model, not only are samples assigned to different groups (i.e., \i i = 
f^j, 1 < i 7^ 3 < n with positive probability), the nonzero components of the mean 
specific to a cluster are also clustered (i.e., /i^ = fak, 1 < j ^ k < p with positive 
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probability). In this paper, we choose to use a more parsimonious model rji = rj. 
The prior structure presented above has individual probability iTij that attribute j 
has a nonzero effect for cluster c to which the z-th sample is assigned, while the 
attribute specific parameter pj indicates the sparsity propensity of the covariate j. 
Marginalization over ir^ gives the more traditional point-mixture prior 



Kl 



" P^o + ^PjDP^N^r] 2 )). 



a + b 



Similar structur e has been used in th e regression context in 



Seoetal 



(120071 ): 



Carvalho et al 



Lucas et al. 



(120061 ) 



(120081 ). As discussed in those papers, the extended 
model is able to more adequately shrink towards zero through the induction of zeros 
for 7Tjj and thus can better separate real signals from noise and reduce false discovery. 

Finally, we describe the choice of hyperpriors and the setting of hyperparameters. 
The base measure of the DP prior for pj is set as a normal distribution with p = 
y.j, (Jq = Y^=i(V-j ~ y) 2 /P where y.j = J2i Vij/ n is the observation mean for attribute 
j and y = Y2jV-j/ n * s the overall mean of all observations. For the base measure 
of DP prior for crj, we use the vague prior Inv — Gamma(0. 5, 0.5). Similarly, the 
standard vague conjugate prior Inv — Gamma(0.5, 0.5) is also used as prior for rj 2 . 
For the four concentration parameters in the DPs, r, a, (3, 7, Gam rna(0.5, 0.5) i s used - 



Lucas et al. 



(|2006h 



as the prior. Finally, in the point-mass mixture prior, we follow 
and set a = 9, b = 1, c = 0.2, d = 199.8. 

At the end of this section, we emphasize that although our specification of Dirich- 
let process mixture model has a nested structure in that the cluster component de- 
termined by the DP mixture has a base measure on a p-dimensional vector whose 
component has as its distribution a mixture of zero-point mass an d a Dirichlet pro- 



cess, the model is entirely different from the so-called nested DP (IRodriguez et al. 



20081 ). In a nested DP, the base measure is itself a Dirichlet process. The application 
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in mind for nested DP is clustering of clinic centers with the goal of identifying groups 
of centers with similar patient outcome distributions. 

3 Posterior Computation 

Let q, 1 < i < n, be the latent class indicator associated with sample i, with the 
specific numbering of no significance, although in the presentation of the algorithm, 
we assume 1 < Cj < K when K clusters are non-empty. Similarly, c 1 ^ and cj is 
the cluster indicator for the base-line mean and variance fij and <r|. We also use 
c C j,j = l,...,p, to indicate the clustering structure for the p components of the 
mean vector specific to the c-th cluster /x c = (/i c i, ■ ■ ■ , Hc P ) with c C j = indicating 
fi C j = is generated from the zero point mass. Similarly as before, we say q,- = c C j 
if sample i is assigned to cluster c. After this augmentation of data, we can update 
each of the unknowns iterating between the following steps. 

1. For j — 1, . . . ,p, draw a new value for c 1 ^ using the following conditional prob- 
abilities 

p (°j = c H k n -j,c N ( fai ~ /^)/ n K l l v + a V n )i 

l<i<n 

P{d x j ^ cf,V7 ^ j\-) oc aN(^ (Vij ~ Vij)/ n \Vo,vl + tf/n), 

l<i<n 

where N(x\/i, a 2 ) denotes the normal density evaluated at x, u — [ji^/a 2 + 

T.i< i <n,keCt j S Vik ~ ^ k )l a k\l v ^ v = l/a% + Ei<i<„,fcec^ iC l / a l C -j,c contains 
all attribute indices other than j that are assigned to cluster c and is the 
size of Cl j c . 

Then for all c G {c^, . . . , c£}, draw fx c from N(u, v) with u = [/io/^o+X!i<i<„jec c M (^j~ 
v = l/a 2 + Ei<K n , jeC? V^, where C£ = {j : 1 < j < p, c? = c}. 
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2. For j = 1, . . . ,p, draw a new value for cj using the following conditional prob- 
abilities 



«r(«) (« + Ei<i<„4/ 2 )" + " /2 ' 



r(ao) (/?o + E,<,<„4/ 2 )°" + " /2 ' 

where T(.) is the Gamma function, = yij — fij — fiij, u = ao + n a _^ c x n/2, 
v = Po + J2i<i< n kec - z 1kl^i C-j tC contains all attribute indices other than j 
that are assigned to cluster j and n°L j c is the size of Cf^ c . 

Then for all c G {cf, ...,c£}, draw a 2 c from Jnf — Gamma(u,v) with -u = 

a +< x n/2 and v = Po + J2i<i< n ,jec- z fjl 2 where C l = 0' : 1 < 3 < P, c j = c ) 
and n? c is its size. 

Now suppose in the current iteration, all non-empty clusters associated with 
one or more samples are indicated by {1, . . . , K}. 

3. For c G {1, . . . , K}, j = 1, . . . ,p, draw ir C j from the conditional distribution 

n C j\Hcj, Pj ~ (1 - Pj)S + pjBeta(a, b + 1) if fi cj = 0, 

ir C j\/i C j, pj ~ Beta(a + 1, b) if /x C j 7^ 0. 

4. For j — 1, . . . ,p, draw pj from the conditional distribution 

~ Betaic+lUjld + K- |n,-|), 

where IT,, = {c : 7r C j > 0} is the set of nonzero probabilities 7i C j associated with 
attribute j and |IL,-| is the size of the set. 

5. In this step, we update the clustering assignments of the samples q as well as 
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the c l uster- specific mean vector n c . This basically makes use of Algorithm 7 in 
Neall (120001 ) which is reproduced here for completeness. 



(a) For i — 1, . . . , n, if q is not a singleton (i.e. q = Cj for some j 7^ i), let 
c* = K + 1 be a new cluster component with /x c . drawn from the prior 
fity ~ (1 - opj/(a + 6))5 + opj/(a + b)DP(>yN(0,ri 2 )). Accept c* with 
probability 

mm[l, 



n- 1 F{y i - 1 n Ci ) ' 
where F(y,; /x c ) = ]\ p j=l N{y ij \^i cj , a]). 

If Q is a singleton, propose c* = c, c 6 {ci,...,c n } with probability 
n-i c/in — 1) (n-i c is the number of samples excluding the i-th sample 
that are currently assigned to cluster c) and accept with probability 

. . n-l-F(y»; /*«*), 
mm[l, 



(b) For % = 1, . . . , n, if q is not a singleton, set q = c, c G {ci, . . . , c n } with 
probability 

P(cj = c|-) oc -^^F(y,; ^ c ). 
n — 1 

(c) For all c G {ci, . . . , c n }, perform a Gibbs sampling step for the components 
of A* c |{yi}, {Pj} with i G {1 < k < n : c k = c} : 

i. For j — 1, . . . , p, update c C j from the following distribution 



P{c cj = 0|-) cc (1- ^^p.OAr^lO,^ 2 /^), 
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for d e {c ck , 1 < k < p, k ^ j}, 



P(c cj = c'|) oc ^—^pjU-j^N^Xjlu, 1/v + (7j/n c ), 



a + b 



P(c C j ^ c c fcV/c ^ j\) oc — —p j jN(x j \0,T] +ajn c ), 

a + b J 

where Xj = £ i:c . =c (^ ~ Vj)l n c> u = J2f. Ccj =c'( x j A 7 ?) l v and v = 

W + £ i:Ccj=c , 1/4 

ii. For c' G {c c i, . . . , c cp }, draw a new value for /z cc / from fi C c'\{yij}, with q 
C c j c . 



Note that here we used a partially collapsed Gibbs step (Ivan Dyk and Park 



20081 ) by integrating out 



7T; 



6. Draw 7/ 2 from the conditional distribution 



T] 2 ~ 7to - Gamma((l + n»)/2, (1 + ^ /4)/ 2 ), 

(cj)eo 

where C M is the set of indices of all nonzero unique values of p C j, and n M is the 
size of the set. 

7. Draw the concentration parameter s a, /3, 7, r using the data augmentation ap- 



proach of 



Escobar and West 



(1995) 



A note is in order. All the updates above are obtaine d by s tanda rd calculations. 
In particular, step 5 is a reproduction of Algorithm 7 in iNeall (120001 ). with the only 
difference being step 5(c) where Gibbs sampling must be used since the conditional 
distribution /x c conditional on samples assigned to cluster c is not directly available 
in closed form. 
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In step 5(a) above, if q is a singleton, we draw a new value for /x c from the 
prior, which is a high-dimensional vector, and the prior distribution is a mixture of 
zero point mass and a Dirichlet process. This typically makes F(yj; fj, c ) extremely 
small, which is not surprising since n c drawn from the prior can hardly explain the 
observed sample y$ well. When the update is implemented as presented, this proposal 
is almost never accepted. To solve this problem, we successfully used a sequential 
sampling approach that proposes a new value for /x c taking into account the observed 
y,. The proposed sequential sampling step generates the new value with the following 
scheme: 

• With c — K + 1, for j = 1, . . . ,p, draw c C j from the following distribution 

P( Ccj = 0|-) oc (i--JL lPj )N{x j \0,o?/n c ), 

P(c cj = c'\-) oc —^—p j n- jjC N(x j \u, 1/v + a 2 Jn c ), d G {c ck : k ^ j}, 
a + o 

P(c cj ^ c ck \/k oc —^—p j aN(x j \0, rf + o?/n c ), 

where we conditioned on {p c k}k=ii Xj, {&k}k=i-> Pj an d V- ^ n the above we define 

X 3 = J2i: Ci =c(yij - Pj)/ n c U = Ej:c cj =c' X j/ (T ]/ V aIld V = 1 / V + E J:Cc ,=c' V^- 

• For d G {cfci, . . . , Cfcp}, draw a new value for p cc i. 

These expressions are very similar to the Gibbs step 5(c), with the important differ- 
ence that when proposing new value for c C j, only the previously sampled c c k,k < j 
are available and the update is performed sequentially. With this change of proposal 
distribution, the acceptance probability in step 5(a) should be changed to 

_ T F(yi, He*) QoiVc*), 

mm 1, ; V^tt V 

1 'n-lFfy^J Q( Mc| ) J 
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and 



min[l 



n-lF( yi ;tx c *) Q( Mc J 



respectively, where Qo is the proposal density when /x c is drawn from the prior and Q 
is the proposal density of [i c when it is drawn from the sequential sampling approach 
described above. 

Finally, for inferences on cluster-specific p arameters, such as fe and we nee d 



2000 



Tadesse et al. 



20051 ). 



to take care of the label switching problem (jStephend . 
This can be done conditionally on the number of clusters for the samples, K. We 

(hood ) for details. 



refer the readers to 



Tadesse et al 



4 Extension to Variance Shifts 

Although it is not our focus in this paper, our model can easily be extended to the 
case where groups are distinguished by both different mean and variance for one or 
more attributes. On can extend the model and write the observed data as 

Vij = /I, + + CTjUijeij, c, l '~ ' N(0, 1). 

The extra factor cr^- indicates the difference in variance for data assigned to different 
groups. Similarly to the mean shift model, we can put a hierarchical sparsity prior 

a?- ~ (1 — 7Ty)5i + -K°jDP(Klnv — Gamma(a a , (3 a )), 
tt£ ~ (1- p°)6 + p°Beta(a,b), 
p°j ~ Beta(c,d). 

Note that for variances, sparsity means many of the will be exactly equal to 
one. For the base measure, we should also choose it to be roughly centered at one for 
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identifiability, for example, we can use a c = 1.5, (3 a = 0.5 as a vague prior. 

This extension causes few extra complications on the updating strategy for poste- 
rior computation, with extra updates for <7y as well as some slight change of formula. 
The details are omitted here. We do not consider further the case with variance 
shift since one can argue that for the microarray analysis for example, the researchers 
typically focus on mean shift as a distinguishing feature of tissue subtypes. 

5 Simulation and Application 

5.1 Simulation Study 

We investigate the performance of our estimation method in a simulation study. A 
dataset containing 20 samples and 200 covariates is generated as follows. 



Vij 


= fj+j + (Tjdj, €ij ~ N(0, 1), 


[J>i,j 


= 0.25,1 < i < 5,1 <j < 5, 




= 0.1,6 < i < 10, 1 < j <5, 


[J>i,j 


= -0.1,11 <i < 15,1 <j <5, 


[J>i,j 


= -0.25, 16 < i < 20, 1 < j < 5, 




= 0.2,1 < % < 5,6 <j < 10, 


fl>i,j 


= -0.15,16 < i < 20,11 <j < 15, 




= otherwise, 




= 0.1, 1 < j < 15, 


°3 


= 0.05, otherwise. 
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The structure of pij is shown in Figure Ufa) where different values for p^ show up 
as different gray levels. Each row in the image represents a sample and each column 
represents an attribute. Only the first 50 attributes are shown. The first 5 covariates 
distinguishes across all four groups, while attributes 6-10 distinguish the first cluster 
from the others and attributes 11-15 distinguish the fourth cluster from the others. We 
use the model described in Section 2 to fit the simulated dataset. Figure [T^b) shows 
the observed data in the same format as Figure QJa). Figure [2] shows the posterior 
updates for the number of clusters identified as well as gives some indication of the 
mixing of the Markov chain. The posterior gives strong support for four clusters, with 
support for five clusters comes next. In simulation as well as real data application 
that follows, we used a burn-in period of 10, 000 updates and 40, 000 iterations after 
burn-in for inferences. The posterior estimates of p^ is shown in Figure [Tfc) as 
a matrix for the first 50 attributes only. Four clusters and the zero structures are 
clearly identified. In contrast, t he hierarchical clustering based on COSA algorithm of 



Friedman and Meulmanl (120041 ) failed to identify the true clusters, as shown in Figure 
[3] for single, average, and complete linkage. 

In Figure HI we show the posterior estimate of pj for 1 < j ' < 50, which indicates 
the contribution of the j-th attribute to cluster discrimination. The results are quite 
encouraging, with the first 15 attributes clearly identified as signal variables and 
the first 5 attributes estimated to be associated with larger values of pj, consistent 
with the simulation scheme. We can also use a simple threshold of 0.5 on posterior 
estimates of 7r^. In particular, we decide attribute j to be relevant for clustering if 
7Tjj > 0.5 for at least one i. This strategy also exactly identifies the first 15 attributes 
as significant. 

Finally, for this simulated example, using DP for pj and pij, 1 < j < p performs 
better than a normal prior (corresponding to the case with a — > oo and 7 — > 00. The 
mean squared error of pj + p^, 1 < i < 20, 1 < j < 15, under our model is 0.006, in 
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contrast wi th 0.011 when a, 7 — > 00. This is consistent with the results reported in 



Nott 



fl2008h 



Now we use additional simulated examples with various choices of the number of 
attributes p and the values for mean vectors to investigate the performance of our 
method. Besides the example above, we use the following simulation schemes. All 
examples are simulated from model j/y = [i^ + 0^ as before. 

• Example 2. Same as the example presented above, except the number of noise 
variables are increased to p = 1000. 

• Example 3. Here we have n = 20 samples and p = 50 attributes, among which 
10 attributes are informative for clustering across all four clusters. The cluster 
sizes are chosen to be 3, 3, 7, 7 respectively, with fj^j = c/4 if sample i belongs 
to cluster c, 1 < c < 4, when 1 < j < 10, and <jj = 0.1 for all j. 

• Example 4. Here n = 20 and p = 50, with /j,^ = j/50 when 1 < i < 10, 
^ = (50 - j) /50 when 11 < i < 20, and Uj = 0.1 for all j. 



Our method is compared to t wo other m odel- 



variable selection proposed in 



Hoff 



jased Bayesian c lustering methods with 



Kim et al. 



(120061 ) . For each example, we 



((20061) and : 

simulated 50 datasets. The different methods are compared using three performance 
measures. First, we compute the average mean squared errors for /ijj where we only 
consider attributes that are relevant for clustering. Also, we consider the number of 
attributes selected by each method as well as its overlap with true relevant attributes. 
Based on the existing implementation for Hoff 's approach, the attributes are selected 
such that they maximize the joint posterior distribution. Besides, all three approaches 
can identify the correct number of clusters in all examples. 

The results in Table [1] show that our approach always outperformed the other 
two for all the examples considered here in terms of mean squared error. Examples 
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1 through 3 are perhaps favorable to our approach. Example 4 was intended for a 
situation where all components for a cluster mean are distinct. Our method still 
works well in this situation in terms of mean squared error, com pared with the other 
two approaches. In terms of selected variables, the approach of iHoffl (120061 ) tends to 
select a large number of relevant variables resulting in high false discovery rate. When 
there exist attributes that only distinguish a subset of cluster comp onents, such as the 



Kim et 



al 



(2006) 



Kim et al 



situation in Examples 1 and 2, the variable selection approach of 
tends to miss some of the those covariates. Note that the approach of 
(120061 ) cannot give informations on whether an attribute only distinguishes a subset 
of the cluster components even if the attribute is selected. 



5.2 Leukemia Gene Expression Data Example 



We use the leukemia gene expression dataset (IGofub et all Il999l ) to demonstrate the 
utility of our proposed method. The training dataset contains 38 tissue samples, 
among which 11 samples are acute myeloid leukemia (AML) and the rest are acute 
lymphoblastic leukemia (ALL). The 27 ALL samples are further divided into two 
subgroups: 8 T-cell and 19 B-cell samples. The samples were arrayed with a total 
of 7129 genes in a microar ray experiment. Following the standard preprocessing 



steps in 



Dudoit et al. 



(120021 ) . we truncate the expression values to within the interval 
[1, 16000], and delete those genes whose maximum and minimum expression across 
all samples satisfies max / min < 5 and max — min < 500. Finally, we select the top 
2000 genes with the largest variances across all samples so that at the end we have 
for this dataset n = 38, p = 2000. 

We apply our proposed method to the dataset with the hyperparameters set ex- 
actly as discussed in Section 2. Convergence of the MCMC updates is invariably a 
concern in high- dimensional problem with variable selection. As a simple diagnostic, 
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two MCMC runs of 50,000 iterations with the first 10,000 as burn-in are implemented, 
with different initialization. In particular, we start one Markov chain with initially 
all samples assigned to one cluster, and another chain where each sample is assigned 
to its own separate cluster. The posterior estimates of various unknown quantities 
for the two runs shows good agreement which indicates the chains mixed well in our 
implementation. 

As shown in Figure El the posterior for this dataset put most of the support for 
the number of clusters between 3 and 9, with 6 clusters receiving the highest score. 
Conditional on K = 6, setting the threshold 0.5 for the posterior estim ates of 7r r j 
return s 872 genes. This is muc h larger than the 120 genes reported in 



( 120061 ). Previous studies, such as 



Thomas et al. 



Kimetal 



(120011 ). also demonstrated that there 



were a large number of genes differentially expressed between different tissue samples. 

For inference about the cluster structure and comparison to known tissue subtypes, 
we estimate the posterior probability of q = c, 1 < c < 6 from posterior samples con- 
ditioned on K = 6 with the help of the procedure that deals with label switching. 
Each sample is allocated to the cluster with the largest posterior probability. The 
relationship between this allocation and known tissue types are shown in Table [2J 
Under our method, one of the ALL samples is misclassified into a cluster dominated 
by AML samples, also one of the AML samples is misclassified into a cluster domi- 
nated by ALL samples. We see that the known AML and ALL-B tissue types might 
further consist of some subtypes. Using our method, we can also discover genes that 
distinguish only some subgroups. For example, among those 872 genes relevant for 
clustering only 64 of them can distinguish between ALL and AML samples without 
discriminative power for different subtypes. 
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6 Conclusion 



In this article, we propose a novel Bayesian approach to high-dimensional clustering 
with variable selection. The distinguishing features of our method include a separate 
Dirichlet process for shrinkage estimation of cluster mean, as well as a hierarchical 
point-mass structure that improves the separation of significant signal from noise vari- 
ables. We propose a sequential sampling approach in one of the updating iterations 
of the MCMC algorithm to solve the computational problem associated with the high 
dimensionality of the mean vector. 

Our approach only involves diagonal covariance matrices. It has been argued in 
other studies in both supervised and unsupervised context that for "high dimensional 
low sample size" setting, this working i ndependence assumption is more effective than 



the 



2004 



ull covariance matrix a pproach (IFraley and Raftery 



2006 



Bickel and Levina 



Tibshirani et al. 



20031 ). Generalization of our method to general covariance 



structure seems much more challenging. 



In our implementation, we choose to use Algorithm 7 in lNeall (120001 ) f or its simplic 



ity in implementation. More efficient approaches like split-merge update (Uain and Neall . 



20041 ) can also be utilized. Nevertheless, we still expect the original algorithm should 



be modified using sequential sampling instead of drawing new components from the 
prior to make the implementation feasible in practice. 
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Table 1: Simulation results for the four examples. The numbers in each cell cor- 
respond to median mean squared error, number of attributes selected and overlap 
between selected attributes and the truth, respectively. 



Methods 


Example 1 


Example 2 


Example 3 


Example 4 


Our approach 
Hoff (2006) 
Kim et al. (2006) 


0.005 15 15 
0.013 19 15 
0.018 14 14 


0.021 18 13 
0.046 26 10 
0.069 14 8 


0.012 8 8 
0.027 17 7 
0.016 11 10 


0.016 46 46 
0.025 50 49 
0.021 48 48 



Table 2: Clustering results for leukemia expression data conditional on K = 6. 





cluster from the proposed method 


samples 


1 


2 


3 


4 


5 


6 


ALL-T(8) 








8 











ALL-B(19) 





1 





6 


4 


8 


AML(ll) 


7 


3 








1 
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attributes attributes 

(b) (c) 

Figure 1: (a) Mean structure ^ for the simulated data plotted as an image, (b) 
Noisy observed data, (c) Estimated mean under our approach. 
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Figure 2: Trace plot of the number of clusters K for the simulated dataset. 
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Figure 4: Posterior estimates of pj for 1 < j < 50. 
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Figure 5: Posterior estimate of the number of clusters for the leukemia expression 
dataset. 
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